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Abstract 

We consider the problem of flexible modeling of higher order Markov chains when an 
upper bound on the order of the chain is known but the true order and nature of 
the serial dependence are unknown. We propose Bayesian nonparametric methodol¬ 
ogy based on conditional tensor factorizations, which can characterize any transition 
probability with a specihed maximal order. The methodology selects the important 
lags and captures higher order interactions among the lags, while also facilitating 
calculation of Bayes factors for a variety of hypotheses of interest. We design effi¬ 
cient Markov chain Monte Carlo algorithms for posterior computation, allowing for 
uncertainty in the set of important lags to be included and in the nature and order 
of the serial dependence. The methods are illustrated using simulation experiments 
and real world applications. 


Some Key Words: Bayesian nonparametrics. Categorical time series. Conditional 
tensor factorization. Higher order Markov chains. Sequential categorical data. 
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1 Introduction 

For t = 1,... ,T, consider a time indexed sequence of categorical variables {yt}- We 
assume that the distribution of yt may depend on the values at the previous q time 
points, i/t-i, ..., yt-q- For t = (g + 1),..., T, the transition probability law governing 
the evolution of the sequence satisfies 

p{yt I yt-i, ...,yi)= p{yt \ yt-i, ■ ■ ■,yt-q), 

and the likelihood function of the sequence admits the factorization 

T 

P(yi:T) =Po(yi:<?) n I 

t=(<J+l) 


where po denotes the distribution of the initial q variables yi:^; we follow common 
convention and condition on the initial observations to avoid modeling po- 

We call such a sequence a Markov chain of maximal order q if conditional on 
the values of (pf-i,..., yt-q), the distribution of yt is independent of its more distant 
past, but the actual lags important in determining the distribution of yt may be an 
arbitrary subset of (pt-i,..., yt-q)- In contrast, if the distribution of yt actually varies 
with the values at all the previous q times points, we call the sequence a Markov chain 
of full order q. The case q = 0 corresponds to serial independence. 

For a chain with Cq states, there are Cq — 1 free parameters in the conditional dis¬ 
tribution of yt, which can potentially vary arbitrarily with every possible combination 
of the levels of the previous variables. For a Markov chain of maximal order q, there 
are a total of such combinations, and hence the number of parameters in the full 
model is (Cq — This number increases exponentially in the order of the chain, 

creating estimation problems as q increases. It is very important to define flexible, 
parsimonious and interpretable representations, with unnecessary lags eliminated. 

A common approach to modeling higher order Markov chains is based on multi¬ 


nomial logit or probit models, with the lags included as linear predictors (Liang and 
Zeger, 1986 Zeger and Liang, 1986). Modeling order interactions among the lags 


using such models would require the inclusion of (^)(C'o — 1)^ interaction terms in 
the set of linear predictors. The number of interaction terms thus increases rapidly 
with Cq and q. For example, with only 5 lags and 4 categories, accommodation of 
second order interactions requires the inclusion of 90 interaction terms. In practical 
applications, attention is thus often restricted to only a small number of lags and low 
order interaction terms (|Fahrmier and Kaufmann 1987). 
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An alternative that can accommodate a relatively large number of lags but ig¬ 
nores interactions among lags is mixtures of transition distributions (MTD). In the 
basic MTD model (Raftery, 1985a), the transition probability p{yt \ yt-i, ■ ■ ■ ,yt-q) is 
a linear combination of Q{yt-i,yt), ■ ■ ■ ,Q{yt-q,yt), where Q is a Co x Co transition 
matrix for a hrst order Markov chain. [Raftery (1985b) and Berchtold ( |1995 , 1996) 
allowed different transition matrices for different lags. Raftery and Tavare (1994) and 


Berchtold and Raftery (2002) discussed estimation algorithms and other generaliza¬ 


tions. While MTD leads to parsimonious models for higher order Markov chains, it 
is not structurally rich, particularly when size of the state space and/or the order of 
the chain is large. Additionally, the model implicitly assumes the process is of full 
order g, with selection of q requiring rehtting for different choices. 

Another popular strategy to modeling higher order Markov chains is based on 
trees with conditioning sequences of different lengths as nodes and leaves. Variable 
length Markov chains (VLMC) dBiihlmann and Wyner 1999; Ron et al. , |1996 ) prune 
large branches, keeping only those nodes whose effects on yt are different enough 


from their parent’s. Context tree weighting (Willems et al. 

1991 

)) uses an ensemble 

of trees of varying depths. The sequence memoizer (Teh 

2006; 

Wood et al, 2011) 

uses a hierarchical prior to center the children p{yt yt-i,... ,yt-r) around their 


parent p{yt \ yt-i, ■ ■ ■ ,yt-{r-i)) for each r > 1, which favors a restrictive structure. 
In general, tree based methods are not suitable when a more distant lag may be a 
more important predictor of yt than a relatively recent one. Sparse Markov chains 
(SMC) (Jaaskinen et ai, 2014) attempt to remove this limitation. SMCs cluster the 
lag combinations having similar influence on the transition distribution of yt, related 
to VLMC but leaving the partitioning unrestricted. Such hard clustering may lead 
to oversimplihcation of the dependence structure for long sequences. Additionally, 
hard clustering and tree based approaches do not explicitly characterize signihcance 
of individual lags or provide a framework for testing of related hypotheses. 

In this article, we take a fundamentally different approach. Tensor factoriza¬ 
tions for categorical regression have been developed in Yang and Dunson (2015). We 
adapt these factorizations to our dynamic setting, while incorporating substantial 
improvements to the structure and computation. The proposed formulation leads 
to parsimonious representations of transition probability tensors, shrinking towards 
low dimensional structures and borrowing strength across lags, while being flexible 
in capturing complex higher order interactions. The method allows automated or¬ 
der and lag selection, quantifying uncertainty in selection and facilitating testing of 
hypotheses. Convergence of the posterior to the true transition probability tensor is 
guaranteed under ergodicity of the true data generating process. Taking a novel ap¬ 
proach to posterior computation in variable dimension models, we develop an efficient 
Markov chain Monte Carlo (MCMC) algorithm. 
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The article is organized as follows. Section details onr model and its interesting 
aspects. Section [^describes MCMC algorithms to sample from the posterior. Section 
[^presents the results of simulation experiments comparing our method with existing 
approaches. Section presents some applications of the proposed method. Section 
contains concluding remarks. 


2 Model Specification 


2.1 Review of Tensor Factorizations 

There is a vast literature on tensor factorizations, the two most popular approaches 
being parallel factor analysis (PARAFAC) and higher order singular value decomposi¬ 


tion (HOSVD). PARAFAC (Harshman, 1970) decomposes a. DiX - ■ - xDp dimensional 
tensor M = as the sum of rank one tensors as 


m 




'^9hWu\^\xj) 


( 1 ) 


h=i j=i 


In contrast, HOSVD, proposed by Tucker (1966) for three way tensors and extended 


to the general case by De Lathauwer et al. (2000), factorizes M as 


fci 


m. 


Xl,.. 


E-E 9hi,...,hp 


hi = l 


U 


U), 


Xi 


( 2 ) 


i=i 


where G = {ghi,...,hp}) called a core tensor, captures interactions between the differen; 
components and Uj = component specihc weights. See Figure 

HOSVD achieves better data compression and requires fewer components comparec 
to PARAFAC, which can be obtained as a special case of HOSVD with G diagonal. 
The tensor factorization that is most relevant to our problem was introduced in 


Yang and Dunson (2015) (YD). YD considered the problem of regressing a categor¬ 
ical response variable y G {1,..., Dq} on categorical predictors Xj G {1,..., D^}, 
j = 1,... ,p. Structuring the conditional probabilities p{y \ Xj,j = 1,... ,p) as the 
elements of a Do x Di x ■ ■ ■ x Dp dimensional tensor, YD proposed the following 
HOSVD-type factorization 


fci 


'P 

p{y I Xj,j = 1,... ,p) = ^ ^ K-hp{y)YlT^h^{xj), 




( 3 ) 


^1 = 1 
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Figure 1: Pictorial representation of HOSVD for a 3 way tensor M with core tensor 
G and weight matrices Uj,j = 1, 2, 3. 

where 1 < kj < Dj for j = 1,... and the parameters Xhi,...,hp{y) and are all 

non-negative and satisfy the constraints (a) J2y=i Xhi...hp{y) = 1 for each combination 
(hi,..., hp), and (b) Y^h =i 1 for each pair {j,Xj). They established that 

any conditional probability tensor can be represented as ([^, with the parameters 
satisfying the constraints (a) and (b). The constraints (a) and (b) are thus not 
restrictive but they ensure that Yy=iP(y I = 1) • • • )P) = 1- 

Taking a Bayesian approach, they assigned sparsity inducing priors on the k/s 
and conditional on the fcj’s, placed independent Dirichlet priors on Xhj^^,,,^hp{yys and 
as , Xh^^...,hp{Do)} ~ Dir(l/T)o, • • •, I/-D 0 ) for each combina¬ 

tion (hi,..., hp) with 1 < hj < kj and {7r[^\xj ),..., 'x^^^ixj)} ~ Dir(l/hj ,... ,1/kj) 
for each xj G {1,..., Dj}. The dimensions of these parameters vary with h^-’s, mak¬ 
ing the design of efficient MCMC algorithms challenging. YD used an approximate 
two-stage sampler, selecting the h/s in the hrst stage and then sampling the other 
parameters in the second stage while keeping the h/s hxed. 

2.2 Higher Order Markov Chains via Tensor Factorization 

We propose a nonparametric Bayes approach for inferring the order and structure 
of higher order Markov chains building on a YD-type conditional tensor factoriza¬ 
tion. In our dynamic setting, we have a time-indexed categorical sequence {yt} 
with hnite memory of maximal order q taking values in the set {1,... ,Co}. Given 
yt-i, ■ ■ ■, yt-q: the distribution of yt is independent of all observations prior to t — q. 
The variables that are important in predicting yt can potentially constitute a subset of 
{yt-i ,..., yt-q}- For f = g -Fl,..., T, the transition probability p{yt \ yt-i, yt-q) 
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is structured as a Cq x Cq x • • • x Cq dimensional tensor and admits the factorization 


fcl kg q 

pivt I yt-jj = i,...,q) = Xh^,...,hgiyt) n (4) 

hi=l hg=l j=l 

where, with some repetition, 1 < kj < Cq for all j and the parameters Xhg,...,hp{,yt) 
and 7rl^^{yt_j) are all non-negative and satisfy the constraints 


Co 

^hu...,hg {yt) = 1, for each combination (hi,..., hg), (5) 

yt=i 

kj 

X] (ht-i) = 1, for each pair {j,yt-j). (6) 

hj=i 

Introducing latent allocation variables Zj^t for each j = 1,..., g and f = g-|- 
1,... ,T, the response values are conditionally independent and the factorization can 
be equivalently represented through the following hierarchical formulation: 


iyt I Zj^t = hj,j = l,...,q) ~ Mult({l,...,Co},A/ii,„„^(l),...,Afe,,..„/j^(C'o)),(7) 
izj,t I yt-j) ~ Mult({l,.. ., kj}, 7r[^\yt-j),... {yt-j)). (8) 

See Figure Posterior computation is facilitated by sampling these latent auxiliary 
variables. This formulation also aids in understanding interesting features of the 
model. Equation ([^, for instance, reveals the soft clustering property of the model 
that enables it to borrow strength across the different categories of yt-j by allowing 
the Zj^t^ associated with a particular state of yt-j to be allocated to different latent 
populations, which are shared across all Cq states of yt-j- Equation ([^, on the 
other hand, shows how such soft assignment enables the model to capture complex 
interactions among the lags in an implicit and parsimonious manner by allowing 
the latent populations indexed by (hi,..., hq) to be shared among the various state 
combinations of the lags. 

When kj = 1, Ti^^\yt-j) = 1 and p{yt \ yt-i, yt-q) does not vary with yt-j. The 

variable kj thus determines the inclusion of the lag yt-j in the model. The variable 
kj also determines the number of latent classes for the lag yt-j- The number of 
parameters in such a factorization is given by (Co — 1) H j=i kj Co J2'j=ii^j ~ f)) 
which will be much smaller than the number of parameters (Co — 1 )Cq required to 
specify a full Markov model of the same maximal order if n ■=! kj < CqT 

In practical applications 11^=1 t)e quite large. For instance, for a 
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Figure 2: Graphical model depicting the dependence structure of a second order 
Markov chain {yt\ (a) without and (b) with latent variables {zj^t}- 

Markov chain with Gq = 4 states and 5 important lags with kj = 3 for all j = 1,..., 5, 
the number of parameters required to specify the core tensor will be 3 x 3^ = 729. 
While this results in a signihcant reduction in the number of parameters compared to 
a fully specihed Markov model of the same maximal order which requires 3 x 4^ = 3072 
parameters, it may still be too large for efficient and numerically stable estimation of 
the parameters for data sets of sizes that are typically encountered in practice. 

Towards a more parsimonious representation, we note that the conditional ten¬ 
sor factorization (|^ can be interpreted as a predictor dependent mixture model 
for modeling distributions supported on {l,...,Go}. Here the probability vectors 
^hi,...,hq = ..., \hi,...,hp{Co)} that constitute the core tensor play the role 

of kernels of the mixture model, and 7 rhi,...,hg {vt-i, • • •, Vt-q) = 11^=1 iVt-j) play the 
role of associated predictor dependent mixture weights. Given ki,... ,kq, the kernels 
are indexed by (hi, ..., hq) with hj = 1, ... ,kj, contributing nj=i mixture compo¬ 
nents to the model. Thus, the number of kernels determines the effective dimension of 
the model. In most applications, a very large number of kernels may not be required. 
This is especially true for discrete distributions supported on a finite set {1,..., Gq}. 

A more parsimonious representation that retains the ffexibility of the original 
model is obtained by encouraging the kernels Xhi,...,hq to be shared amongst the label 
combinations (hi,..., hq) through probabilistic clustering. Specifically, we let 

OO 

E independently for each (hi,..., hq), (9) 

e=i 

= {A£(1), ..., A£(Go)} ~ Dir(a,..., a), independently for £= 1,..., cx), (10) 
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vr^ = (1 — Kn), 14 ~ Beta(l,ao), independently for £ = 1,..., cx). (11) 

m=l 

Introducing latent variables z* and for t = hj = j = 

1,... ,q, we further have 


Pi^hi,...,h = ^) = independently for each (hi,hq), (12) 

{^hl,...,hq I ^hl,...,hq = ■^) = (^i I Zj,t = = 1, . . . , O') = (13) 

( 1/1 I z* = £) ~ Mnlt({l,..., Co}, A*(l),..., A*(Co)). (14) 


See Figure The cluster inducing prior specihed through ([^-([TT]) corresponds to a 
Dirichlet process (DP) prior ( Fergusonf 1973) written in terms of its stick-breaking 
representation (Sethuraman, 1994). Although the prior allows inhnitely many com¬ 
ponents, the number of components occupied by the 11^=1 mixture kernels is hnite 
and likely much smaller than leading to a signihcant reduction in the ef¬ 

fective number of parameters of the model. Our experiments suggest that, even in 
low to moderate dimensional problems, such clustering of kernels greatly improves 
numerical stability compared with assigning continuous priors on the kernels. The 
idea of hierarchical sharing of the kernels constituting the core tensor is not specihc to 
our dynamic setting, and can be easily adapted to other tensor factorization models 
including the original YD model, also eliminating problems with exceeding limited 
storage space that plague YD in applications we have considered. 



Figure 3: Graphical model depicting the dependence structure of a second order 
Markov chain {yt} with latent variables {^yt} and {z*}. 


The DP prior on the mixture kernels Xhi,...,hq treats the kernel indices (hi,..., hq) 
as exchangeable. Although it is conceptually appealing to favor clustering of ker¬ 
nels with similar indices [hi,..., hq), there is an associated signihcant increase in 
model complexity and computational costs. Hence, given that exchangeable priors 
worked well in examples we considered, we did not consider such modihcations fur¬ 
ther. Although we focused on a DP prior for simplicity, other exchangeable clustering 
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priors, such as Pitman-Yor processes ( jPitman and Yor , 1997) or probit stick-breaking 
processes ( Rodriguez and Dunsonf 2011), can be used. 

Next, consider mixture probability vectors ..., (i/t-j)}. 

The dimension of 7r^^\yt-j), unlike the varies linearly with kj. For a Markov 

chain with |/o = 4 states and 5 important lags with kj = 3 for all j = 1,..., 5, the 
number of parameters contributed by all the 7r^^\yt-j) vectors will only be 4 x 2 x 5 = 
40. We thus assign independent priors on Tz^.^yt-j) as 


(2/t-i) = 


rU)l 


r(i)/ 


Dir(7j,...,7j). 


(15) 


The probability vectors 7T^^\yt-j) are supported on {1,..., for each pair (j, yt-j)- 


Therefore, unlike for Xhi,...,hq, conditioning on kj, which we have kept implicit in (15), 


can not be avoided. We, however, do not allow the hyper-parameter jj to vary with 
kj. This has important consequences in the design of our MCMC sampler. Details 


are deferred to Section 13.11 

Finally, model specihcation is completed by assigning priors on k. We assign 
independent priors on fc^’s as 


p(kj = k) = Poj{k) oc exp(-(pjfc), j = 1,... ,q, k = 1,... ,Co, (16) 


where cp > 0. The prior assigns increasing probabilities to smaller values of kj as 
the lag j becomes more distant, reflecting the natural belief that increasing lags will 
have diminishing influence on the distribution of yt- Larger values of ip imply faster 


in the Supplementary Materials. The model space can be restricted to the class of 
Markov models of full order by restricting the prior to satisfy the condition that 
kj= 1 whenever kj = 1 for some j. It is appealing to avoid such restrictions to 
accommodate scenarios where a more distant lag is an important predictor of yt but 
a lag in the more recent past is not. As illustrated in Section such scenarios are 
often encountered in practice. 

Let y = [yt : t = z = {zj^t ■ t = = 1,... ,g} and z* = 

{^hi,...,hq • — 1) • • •) J = 1) • • •) *?}) where t* = {q + 1). Collecting all potential 
predictors of yt in w* = {wpt, ..., Wqj)'^ with Wj^t = yt-j for j = 1, ..., g and t = 
t*,. .. ,T, the joint distribution of y, z and z* admits the factorization 


decay of Poj{k) with increase in j and k, favoring sparser models. See Figure S.l 


p(y, z, z* I A*, 77*, TTk, k) = JJ <{ p{yt I AY ) Ylp{zj,t \ Wjj, kj) }> JJ JJ p{zl^,...,hq I t^*) 


q Kj 


t=t* K J = 1 


j=l hj=l 















( 17 ) 


T 

= I I Wt,7rk,k)} p{z* I 77*) 

t=t* 

5 

= p{y I Z,Z*,A*) JJp(Zj I Wj,7r\^^,kj) p{z* I 77 *). 
i=i 

Here k = {kj : j = 1,..., q}, TT^^^iwj) = {vr[^,^K) : hj = 1,..., kj}, 77^^? = {77j;^?(w^) : 

Wj = 1,..., Co}, TTk = {7^kj : i = 1, • • •, 9}- Also, Zt = {Zj^t : j = 1,..., g} for all t = 
t*,...,T, Zj = {zj^t : t = t*,... ,T} for j = 1,... ,q and Wj = {wj^t : t = t*,... ,T}. 
Here A* = {A^ : i = 1,..., cxd} and 77 * = {tt^ : i = 1,..., cx)} collect respectively the 
atoms and the probabilities of distribntion (|^. The snfhxes k and kj signify that the 
snpports and hence the dimensions of the associated parameters depend on them. 

The proposed model is nonparametric in the sense that it assigns positive proba¬ 
bility to neighborhoods of the trne data generating process. Let Pq denote the trne 
transition probability tensor. Also let d denote the Li distance between two transition 
probability tensors P and Pq defined as 


Co Co Co 

d{P,Po) = I -Po{y I Wl,...,Wq)\. 

Wl=l Wq = l y=l 


Let H denote the prior on the space of all transition probability tensors indnced 
throngh the proposed model and n(- | yi:T) denote the corresponding posterior based 
on an observed seqnence yi:T of length T. The following resnlt establishes posterior 
consistency of the proposed model nnder the mild assnmption of ergodicity of the trne 
data generating mechanism by showing that n(- | yi:T) concentrates in arbitrarily 
small Li neighborhoods of Pq as the seqnence length approaches inhnity. 


Theorem 1. If the true data generating process is an ergodic Markov chain of max¬ 
imal order q, then for any 5 > 0, H {P : d{P, Pq) > 6 \ yi:r} — )■ 0 as T — )■ cx) almost 
surely Pq. 


The proof, deferred to Appendix A[ follows along the lines of the proof of Theorem 
4.3.1 of Ghosh and Ramamoorthi (2003) and ntilizes strong law of large nnmbers for 
ergodic Markov chains. 

We conclnde this section by comparing the proposed approach to sparse Markov 
chains (SMC) (Jaaskinen et ai, 2014). The SMC model can be formnlated as 


k 

pivt I yt-jj = i,...,g) = Y^h{yt)Trh{yt-u ■ ■ ■ ,yt-q), ( 18 ) 

h=l 
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where Ylyh Mv) = 1 for h = 1 ,..., fc, TThivt-i, • • •, Vt-q) = 1 if ivt-i, • • •, Vt-q) e St 
and 0 otherwise, and forms an nnrestricted partition of the set of all possi¬ 

ble valnes of the conditioning sequence {yt-i, ■ ■ ■ ,yt-q)- Equation (18) is a predictor 


dependent mixture model induced via a PARAFAC-type conditional tensor factor¬ 
ization. Introducing latent variables Zt, the model can be reformulated as 


(yt \ zt = h) 
{zt I yt-j,j = 


Mult({l,...,C'o},A,(l),...,A,(C'o)), 

Mult({l,..., /c}, ni{yt-i ,..., yt-q), ■ ■ ■, Trk{yt-i, • • •, yt-q))- 


By forcing Zt^ = Zt^ whenever {yt^-i,yt^^-q) = {yt 2 -i, • • •, yt 2 -q), the model only al¬ 
lows a restrictive hard clustering of the mixture kernels. Additionally, the assumption 
of conditional independence of yt and {yt-i, ■ ■ ■, yt-q) given a single latent variable Zt 
is quite restrictive, and precludes inferences on the importance of individual lags. Ap¬ 
proaches making a similar assumption in the continuous time series literature, such 


as the model of Di Lucca et al. (2013), face similar disadvantages. 


The model proposed in this article is based on a more general HOSVD-type con¬ 
ditional tensor factorization. The mixture components are indexed by a vector of 
indices (hi,..., hg), not a single scalar index h, and the mixture probabilities admit 
a further decomposition as T^hi,...,hq{yt-ii ■ ■ ■ ,yt-q) = 11^=1 The latent vari¬ 

able formulation, given by ( 0 -(§. thus introduces a separate latent cluster indicator 
variable Zj^t for each lag yt-j- This allows explicit characterization of the importance 
of individual lags through the variables k/s. The variables Zj^ts are allocated to dif¬ 
ferent clusters in a soft probabilistic manner - for ti 7 ^ ^ 2 , Zj^n and Zj^t 2 are allowed to 
take different values even when yt^-j = yt 2 -j- The cluster inducing DP prior on the 
mixture kernels Xhi,...,hq provides opportunities for an additional layer of dimension 
reduction. These features enable the proposed model to capture complex serial depen¬ 
dence structures with greater parsimony, making it better suited to high-dimensional 
applications while also facilitating automated lag and order selection. 

3 Estimation and Inference 


3.1 Posterior Computation 

A mixture of hnite mixture (MFM) model has a hnite but unknown number of mix¬ 
ture components. Our proposed model can be viewed as a sophisticated dynamic 
MFM model, with lag-specific number of mixture components kj and mixture prob¬ 
abilities 7T^^\wj). The dimension of varies with k. The most common approach 
to posterior computation in variable-dimensional mixture models is reversible jump 


MCMC (Richardson and Green, 1997). Alternative algorithms include the allocation 
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sampler (Nobile and Fearnside, 2007) and birth-death MCMC (Stephens, 2000). It is 
difhcnlt to design efficient implementations of snch algorithms including in our set¬ 
ting. To bypass this problem, Yang and Dunson (2015) developed an approximate 


two-stage sampler. In the hrst stage, a stochastic variable search algorithm (George 


and McCulloch, 1997) based on an approximated marginal likelihood was used to 


estimate the set of important predictors and corresponding values of fc/s. In the 
second stage, samples of Xhi,...,hp and T^h^ixj) were drawn from their closed form full 
conditionals conditionally on the estimated k/s. 

Recently Miller and Harrison (2015) studied MFM models in the univariate iid 
case drawing parallels with inhnite mixture models. Using a Dirichlet prior on the 
mixture weights = (tti, ..., tt^) ~ Dir(7 ,..., 7), they integrated out and k 
to obtain closed form expressions for the induced cluster conhgurations. Mimicking 
MCMC algorithms for Dirichlet process mixture (DPM) models, they developed a 
sampler that iterates between updating the cluster conhgurations conditional on the 
other parameters and then updating the other parameters conditional on the cluster 
conhgurations, bypassing the problem of dehning proposals for the variable dimen¬ 
sional parameter tt^. The ability to marginalize is largely unique to simple Dirichlet 
mixture models, precluding a straightforward adaptation of their algorithm to our 
model. However, we were able to generalize their algorithm to not require marginal¬ 
ization by explicitly sampling k from the posterior. Given k, is of hxed dimension 
and all parameters can be easily updated using standard techniques. The sampling 
of k is thus a key innovative step in our sampler and we outline below how we do 


this. Technical details are deferred to Appendix B 


When the hyper-parameter 7^ does not depend on kj, it is possible to integrate 
out to obtain closed form expressions for p{kj \ Zj,Wj). Specihcally, we have 




p{kj 


= 




nj,i, 


nj,co imaxzj, 


kj = maxzj,..., Co, 


(19) 


where ,.= E<2,Po,l(*:)nSi{r(*:7j)/r(*:7, +i!j,r)} and n,,, denotes the 

frequency of the category of the predictor Wj. By marginalizing out TTk and 
exploiting conditional independence relationships amongst diherent variables, we can 
show that the conditional distribution p(k | y, z, z*. A*, tt*) equals p(k | z,w) = 
YYj=iP{^j I This also follows easily by noting that the Markov blanket of kj, 

after TTk have been integrated out, comprises precisely Zj and w^. This allows us to 
design a collapsed Gibbs sampler that iterates between sampling 7r,z,z*, A* and tt* 
from their closed form full conditionals, and then sampling k from the closed form 
collapsed conditionals p(k \ y, z, z*. A*, tt*). 

To update the parameters A*, tt* and z* of the DPM model, we used the approach 
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of Ishwaran and James (2001), truncating the stick-breaking representation of the 
Dirichlet process prior on the mixture kernels Xhi,...,hq at the component. In the 
examples that we considered in this article, the maximum number of categories was 
4. We set L = 100, which sufficed for modeling conditional probability distributions 
supported on at most 4 categories. 

We are now ready to describe our sampler. In what follows, denotes a generic 
variable that collects the variables that are not explicitly mentioned, including the 
data points y. The sampler iterates between the following steps. 


1. For each (hi,..., hg), sample from its multinomial full conditional 


Co 

^ I C) OC TT^* 

y=l 

where nh,,...,h,{y) = EL* = hi,..., Zg,t = hg, yt = y]. 

2. For ^ = 1,... ,L, sample Vn from their beta full conditionals 


p(W I C) = Beta(l + n^,ao + Efc>r^fc)> 

where and update tt* accordingly. 

3. For ^ = 1,... ,L, sample from their Dirichlet full conditionals 


{AKl),...,AKCo)}|C 


~ Dir {a ... ,a + ?7,^(C'o)} , 


where n^( 2 /) = E(hi,...,^) 

4. For i = 1,... ,q and for = Co, sample 

Tik-iwj)} I C ~ Dir{7j + (1),..., (kj)}, 

where nj>^.(hj) = EL* = hj,Wj,t = Wj}. 

5. For j = 1,... ,q and for t = t*,... ,T, sample the Zj^tS from their multinomial 
full conditionals 


p{zj^t = h\C, 


he,i^j) oc 




n.j + 1 , 


(yt)- 
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6. Finally, for j = 1,..., g, sample the using their multinomial full conditionals 


P{kj I C) =p{kj I Zj,Wj) 


po,j{kj) nSi 


r{kj'yj+nj^r) 


U, 




(max Zj) 


kj = maxzj,..., Cq. 


The conditional probability p{kj \ Zj,Wj) depends on zj and wj only through 
maxZj and fhe frequencies of different categories of Wj. For a given data set, 
rij^s are hxed quantities and maxzj G {1,..., Cq). The values of (^) 

different possible values of z, and hence the distribution p{kj \ Zj,Wj), can thus be 
precomputed and stored before running the sampler. 

Using Stirling’s approximation F(n + a)/T{n) ^ n", for moderately large values 
of nj^r^ we can use 


P{kj 


y^C'o po,jW ttCq 

Z^£=maxzj po,j(kj) f fr’=l ^j,r 


kj = maxzj ,..., Cq. 


To facilitate convergence, we initialize the component allocation variables z at the 
cluster allocation values returned by an approximate two-stage sampler designed along 
the lines of Yang and Dunson (2015), see Section S.2 in the Supplementary Materials. 
In experiments with synthetic and real data sets, 50, 000 MCMC iterations with the 
initial 10, 000 discarded as burn-in produced stable results, with trace plots and plots 
of running means and quantiles suggesting no convergence or mixing issues. To reduce 
autocorrelation, we thinned the post burn-in samples taking every 5*^ value. 


3.2 Testing 

In many applications, it is of interest to test for the order of the Markov chain and 
the importance of a particular lag. The explosion in the number of parameters as the 
order increases and paucity of data in many applications have forced the literature on 


nonparametric tests of hypotheses to focus mostly on low order Markov chains (Avery 

and Henderson 

1999; Quintana and Newton, 1998 Xie and Zimmerman 

2014 BesagI 

and Mondal 

2013 

). A particularly attractive feature of our tensor factorization based 


approach is that many such hypotheses of interest can be expressed in terms of the 
variables k. For instance, the hypothesis that the lag is important is equivalent 
to Hq : kj > 1] the hypothesis that the chain is of maximal order go translates to 
^0 ■ kqg > 1 and kj = 1 for j > go; the hypothesis that the chain is of full order go 
can be expressed as Ffo : > 1 for j = 1 ,..., go and kj = 1 for all j > go etc. 

Following Nobile (2004), we use the number of non-empty components or clusters 
formed by the latent class allocation variables to estimate the number of mixture 
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components. Denoting the number of clusters formed by zj by kj, we thus say that the 
jth important predictor of yt if and only if kj > 1. The hypotheses described 

above can be reformulated in terms of the fc^-’s accordingly. Clearly, 1 < kj < kj < Cq. 
The soft clustering aspect of our model makes it difficult to determine the induced 
prior on kj. However, the prior probabilities allocated to the different Hq's described 
above depend only on the prior probabilities of the important special cas es kj = 1. 
Exploiting the symmetry of the Dirichlet prior on and using equation (A.4) from 
Appendix B[ these probabilities can be easily obtained as 


Cq k Cq Cq ^ -p / \ "p / ^ \ 

Po(kj = 1) = ^Pojik) '^pizjj = t\kj = k) = ^ n I r( ^4 + n’>) 


k=l 

l=l 

( Co 

IIT 

r=l ) 

L k=l 


k=l 


Poj{k)k 


it 


( 20 ) 


For moderately large values of Uj^r, Stirling’s approximation can be used to obtain a 
simpler formula. For large values of rij^r, (20) will be close to po^kj = 1) = Poj(l)- 


To conduct Bayesian tests for the different hypotheses described above, one may 


rely on the Bayes factor (Kass and Raftery, 1995) in favor of Hi against Hq given by 


BFiq — 


p{Hi I y)/p{Hi) 
p{Ho I y)/p{Ho)' 


which can be easily estimated based on the output of the Gibbs sampler described 
in Section 3.1 with p{Ho \ y) and p{Hi \ y) equal to the proportion of samples in 


which the kj^s conform to Hq and Hi, respectively. Results of simulation experiments 
evaluating performance of the Bayes factor based tests are summarized in Section 


4 Simulation Experiments 

We designed simulation experiments to evaluate the performance of our method in 
estimating various aspects of the transition dynamics in a wide range of scenarios. 
Some of the cases were generated to closely mimic the real data sets that we an¬ 
alyzed in Section]^ We consider the cases (A) [4, {1,2,3}], (B) [3, {1,2,3}], (C) 
[4, {1,2,4}], (D) [3, {1,2,4}], (E) [4, {1,3,5}], (F) [3, {1,3,5}], (G) [3, {1,4,8}], and 
(H) [2, {1,4,8}], where [Gq, {A,..., A}] means that the sequence has Cq categories 
and {vt-ii, ■ ■ ■ ,yt-ir} are the true important lags. In each case, we considered two 
sample sizes T = 200, 500 and generated an additional N = 500 test data points 
to evaluate prediction performance. The maximal order q of the chain was chosen 
to be two more than the most distant important lag, namely q = 5,6,7 and 10, 
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respectively. To generate the true transition probability tensors, for each combina¬ 
tion of the true lags, we hrst generated the probability of the hrst response category 
as f{Ui) = U^/{Uf -|- (1 — with Ui ~ Unif(0,l). The probabilities of the 

remaining categories are then generated via a stick-breaking type construction as 
f{U 2 ){l — f{Ui)} with U 2 ~ Unif(0,1) and so on, until the next to last category 
(Co — 1) is reached. The hyper-parameters were set at a = I/Cq, ao = 1 and 
= 1/Co for all j. We prescribe using 75 = 1 / 2 , which produced good results in syn¬ 
thetic data sets and real applications, as a default value for (p. The reported results 
are based on 100 simulated data sets in each case. We coded in MATLAB. For the 
case (G) described above, with Co = 3 categories and T = 500 data points, 50, 000 
MCMC iterations required approximately 30 minutes on an ordinary laptop. 

We compared our approach with a multinomial logit model that includes the lags 
of order up to q as linear predictors and ignores interactions, a variable length Markov 
chain (VLMC) model, a sparse Markov chain (SMC) model, and a mixture transition 


distribution (MTD) model. We also included a simple random forest (Breiman 


based model (RFMC) which, like VLMC, is also tree-based but, unlike VLMC, does 
not enforce a strict top-down search. The multinomial logit model and the RFMC 
model were implemented using respectively the VGAM (Yee, 2010) and the random- 
Forest (Liaw and Weiner, 2002) packages in R. The VLMC model was implemented us¬ 
ing the R package VLMC with the pruning parameter selected using the AIC criterion 
(Machler and Biihlmann, 2004). The SMC and the MTD models were implemented 
using MATLAB codes downloaded from http://www.helsinki.£/bsg/£ler/SMCD.zip 
and http://hb.stat.cmu.edu/matlab/GMTD, respectively. Instead of rehtting the 
MTD and the SMC models with di£erent possible choices for the maximal order, 
we set the maximal order at the corresponding true value, giving these models an 
undue advantage over others. 

Performance in estimating the transition probabilities and predicting one step 
ahead response values are summarized in Table and Table respectively. The 
average Li errors were estimated by Y^=t+i Ey=i 1^0(1/ I yt-i,---,yt-q) - P{y I 
yt-i,... ,yt-q)\l{C qN), where Pq and P are the true and the estimated transition 
probability tensors, respectively. The proposed model performed competitively with 
VLMC and SMC approaches when the maximal orders were small and there were no 
gaps in the set of important lags. When the true maximal orders were increased and 
lag gaps were introduced, the proposed approach vastly outperformed all competitors. 
In the latter cases, the VLMC method, which employs a tree based top-down approach 
to determine serial dependencies, fails to eliminate the unimportant intermediate lags 
leading to its poor performance. The SMC and the RFMC methods can accommodate 
lag gaps and in these cases their performances were generally superior to that of the 
VLMC model. However, their strategy of hard clustering large conditioning sequences 
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becomes increasingly ineffective as the true maximal order increases and the proposed 
conditional tensor factorization based approach starts to vastly dominate. It is to be 
noted that our implementation of the SMC method assumed the true maximal order 
to be known in each case. The approach outlined in Xiong et al. (2015) to determine 
the optimal order did not work well in our experiments, producing signihcantly worse 
results. The MTD approach had poor performance in all cases, likely due to its 
restrictive assumption of simple additive effects of different lags. 


Truth 

Sample Size 


Average Li 

Error xlOO 


MLGT 

VLMG 

SMG 

RFMG 

MTD 

GTF 

(A) 4, {1,2, 3} 

200 

20.14 

11.59 

9.77 

12.54 

20.45 

12.67 

500 

19.99 

7.31 

6.95 

10.70 

20.16 

7.66 

(B) 3, {1,2, 3} 

200 

20.60 

9.24 

8.54 

12.59 

22.14 

11.12 

500 

19.74 

5.07 

5.21 

10.59 

21.07 

5.56 

(C) 4, {1,2,4} 

200 

20.19 

16.24 

14.66 

14.42 

20.84 

13.81 

500 

19.40 

11.09 

9.61 

11.35 

20.10 

7.79 

(D) 3, {1,2,4} 

200 

22.05 

13.59 

11.22 

13.41 

23.45 

11.09 

500 

21.38 

8.33 

7.26 

11.03 

22.74 

6.12 

(E) 4, {1,3, 5} 

200 

21.39 

20.51 

18.93 

16.62 

21.56 

15.30 

500 

20.74 

16.66 

14.89 

13.31 

20.91 

8.24 

(F) 3, {1,3, 5} 

200 

23.57 

18.83 

16.12 

15.67 

24.46 

11.58 

500 

22.69 

13.18 

11.24 

12.05 

23.64 

6.38 

(G) 3, {1,4, 8} 

200 

23.86 

24.92 

26.82 

18.24 

24.42 

12.00 

500 

23.41 

22.33 

24.04 

14.41 

24.18 

6.60 

(H) 2, {1,4, 8} 

200 

19.16 

20.14 

17.53 

13.31 

21.80 

7.17 

500 

16.69 

13.07 

11.07 

9.99 

19.74 

3.62 


Table 1: Average Li distances between the true and the estimated transition proba¬ 
bility tensors for our conditional tensor factorization (CTF) based approach com¬ 
pared with a multinomial logit model (MLGT), a variable length Markov chain 
(VLMC) model, a sparse Markov chain (SMC) model, a random forest based (RFMC) 
model, and a mixture transition distribution (MTD) model. In the hrst column, 
Co, {A,..., v} means that the sequence has Cq categories and {yt-ii, • • •, Ut-ir} s-re 
the true important lags. See Section for additional details. The minimum value in 
each row is highlighted. 


Figure summarizes the results produced by our method for the case (G) 
[3, {1,4, 8}] with T = 500 data points for the data set corresponding to the me¬ 
dian classihcation error rate. Panels (c)-(e) in Figure [^illustrate the method’s ability 
to identify the important lags and the maximal order of the chain. 

To assess testing performance, we considered the hypotheses (a) Hq : h > 1, (b) 
Hq : > 1, (c): Hq : kg > 1 and kg = kig = 1, and (d) Hg : kj > 1 for j = 1,4, 8 and 

kj = 1 otherwise for the case (G) [3, {1,4,8}] described above with 500 data points. 
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Truth 

Sample Size 

Classification Error Rates xlOO 

MLGT 

VLMC 

SMC 

RFMC 

MTD 

CTF 

(A) 4, {1,2, 3} 

200 

50.39 

36.50 

33.45 

35.82 

51.40 

35.37 

500 

50.73 

31.27 

29.83 

33.39 

51.54 

30.05 

(B) 3,{1,2,3} 

200 

40.28 

26.42 

24.53 

27.51 

42.53 

26.75 

500 

37.86 

22.43 

22.88 

23.30 

39.84 

21.99 

(C) 4, {1,2,4} 

200 

50.52 

44.03 

42.36 

39.47 

51.23 

37.05 

500 

48.22 

35.14 

35.53 

32.87 

49.26 

28.95 

(D) 3, {1,2,4} 

200 

41.67 

30.28 

27.14 

28.69 

43.94 

25.19 

500 

40.70 

25.22 

23.82 

24.70 

43.03 

22.07 

(E) 4, {1,3, 5} 

200 

52.76 

51.28 

42.36 

42.33 

52.88 

38.35 

500 

50.77 

44.95 

38.38 

37.43 

50.98 

29.90 

(F) 3, {1,3,5} 

200 

45.26 

37.62 

32.39 

31.70 

45.59 

25.59 

500 

42.88 

30.26 

27.28 

26.11 

43.78 

22.25 

(G) 3,{1,4,8} 

200 

45.95 

47.23 

47.27 

35.36 

45.99 

26.86 

500 

45.00 

43.68 

43.59 

29.78 

46.11 

22.90 

(H) 2, {1,4, 8} 

200 

25.23 

26.78 

23.97 

18.77 

26.93 

14.55 

500 

22.37 

19.89 

18.42 

15.02 

24.88 

13.78 


Table 2: Classification error rates of the conditional tensor factorization (CTF) based 
approach in predicting one step ahead response values compared with a variable 
length Markov chain (VLMC) model, a sparse Markov chain (SMC) model, a random 
forest based (RFMC) model, and a mixture transition distribution (MTD) model. 
Iln the first column, Co, {R,... ,ir} means that the sequence has Co categories and 
are the true important lags. See Section for additional details. 
The minimum value in each row is highlighted. 


Figure shows histograms of the estimated posterior probabilities of the alternative 
hypotheses based on 100 simulated data sets. For the cases (a), (c) and (d), when the 
corresponding Hqs are actually true, the method appropriately assigns values close 
to zero, whereas for the case (b), when Hq is actually false, the estimated posterior 
probabilities are very close to one. 


5 Applications 

In this section, we discuss two applications of the proposed conditional tensor fac¬ 
torization approach. In each case, we set the maximal possible order at g = 10. In 
experiments with higher values of g, the results remained practically unchanged. An 
additional application of the procedure described in Section |3.2| to test the order of 
serial dependence in a DNA sequence is presented in Section S^ of the Supplementary 
Materials, showing substantially improved results relative to competitors. These data 
sets have all been analyzed previously in the literature but our proposed nonparamet- 
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Figure 4: Results of simulation experiments for the case (G) with Cq = 3 categories, 
true important lags {yt-i,yt-i,yt-8} and sample size T = 500 for the data set with 
the median classihcation error rate, (a) trace plot of marginal likelihood p{y \ z); 

(b) trace plot of the number of important lags; (c) relative frequency distribution of 
the maximal order; (d) relative frequency distribution of the number of important 
lags; (e) inclusion proportions of different lags; (f) relative frequency distribution of 
the number of clusters of the probability kernels Xhi,...,hq', and (g) relative frequency 
distribution of 11^=1 f^e number of possible combinations of (hi,..., hq). Panels 

(c) -(g) are based on thinned samples after burn-in. See Section [2(2 and Section]^ for 
additional details. 


ric approach provides new insights into their serial dependence structures. Results 
produced by the VLMC and the RFMC methods for these data sets are deferred to 
Section |S.5| of the Supplementary Materials. 


5.1 Epileptic Seizure Data Set 


We hrst reanalyze a data set from 

Berchtold and Raftery 

(2002) (BR), originally 

presented in 

McDonald and Zucchini 

(1997 

). The data set comprises a binary time 


series describing whether a patient experienced epileptic seizures on 204 consecutive 
days. The two states correspond to either no epileptic seizure or at least one epileptic 
seizure. BR analyzed the data set using the MTD model and found that yt is best 
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Figure 5: Histograms of posterior probabilities of alternative hypotheses Hi based on 
100 simulated data sets for the case (G) with Gq = 3 categories, true important lags 
{yt-i,yt- 4 :,yt- 8 } and sample size T = 500. (a) Hq : the lag |/t _4 is important; (b) 
Ho : the lag yt-5 is important; (c) Hq : yts is an important lag but yt-o and yt-10 are 
not, that is, the chain is of maximal order 8; and (d) Hq : the only important lags are 
{yt-i,yt- 4 :,yt- 8 }- For fhe cases (a), (c) and (d), the corresponding Hq^s were actually 
true, whereas for the case (b), Hq was false. See Sections 3.2 and for additional 
details. 


explained by an MTD model with 8 lags with the lag yts being the most important 
one. The other important lags were yt-i,yt-A,yt -5 and yts- 

Figure [^summarizes the results obtained by our conditional tensor factorization 
approach. In agreement with BR, our analysis provides strong evidence for a Markov 
chain of maximal order 8 with yts being the most important lag with an inclusion 
probability of one. With an inclusion probability close to 0.85, yts was the second 
most important predictor. However, in contrast with BR, the distribution of the 
number of important lags was concentrated around 3 with the lags {yts,yt- 4 ,yt- 8 } 
appearing together the maximum number of times. In particular, the inclusion prob¬ 
abilities of yts and yts were very close to zero suggesting that these lags were not 
important predictors of yt- 


5.2 Song of the Wood Pewee Data Set 


Next, we reanalyze a data set from Raftery and Tavare (1994) (RT) that describes the 
morning song of the wood pewee, a North American song bird, comprising 3 distinct 
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Figure 6: Results for the seizure data set. (a) trace plot of marginal likelihood p{y \ z); 

(b) trace plot of the number of important lags; (c) relative frequency distribution of 
the maximal order; (d) relative frequency distribution of the number of important 
lags; (e) inclusion proportions of different lags; (f) relative frequency distribution of 
the number of clusters of the probability kernels Xhi,...,hq', and (g) relative frequency 
distribution of 11^=1 ^ji number of possible combinations of (hi,, hg). Panels 

(c) -(g) are based on thinned samples after burn-in. See Section for additional 
details. 


phrases, labeled 1, 2 and 3. An interesting feature of the data set is that the series is 
dominated by two repeating patterns, namely 1312 and 112. The repeating patterns 
indicate strong interactions among the lags and MTD models are not suitable for 
such data sets. As pointed out by RT, although the first pattern is of length 4, it 
can be specified by four transitions of order 2, namely 1|31, 3|12,1|21, 2|13. Likewise, 
the second repeating pattern 112 can be defined by the second order transitions 
1|12,1|21,2|11. The transition 1|21 and the conditioning sequence 12 appear in both 
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patterns. To accommodate these features, RT modeled the transition probabilities as 


p{yt = yt I yt-i = yt-i,yt-2 = yt-2) = < 



if {yt-i,yt-2} e Ah, 

{l-ah)y^ -- 

A {yt-i,yt-2} ^ Bh, 

^yt 

if {yt-i,yt-2} = {1,2}, 

'^yt 

otherwise. 

A 2 = {2|11}, Bh = {{yt 

1 yt-i,yt- 2 ) : {yt \ 


yt-i,yt-2) i Ah but there exists {yt \ yt-i,yt-2) with yt-i = yt-i and yt-2 = yt-2}, 
1 < tt/i < 1, 0 < TTy < 1 0 < 7 j, < = 1- The construction 

of such complex models with different parameterizations for different conditioning 
sequences requires critical understanding of the important features of the transition 
dynamics on a case by case basis and can not be easily generalized. 

Figure [^summarizes the results obtained by applying our conditional tensor fac¬ 
torization approach to the hrst 500 data points of the wood pewee data set. Panels 
(c), (d) and (e) of Figure indicate strong evidence of a Markov chain of maximal 
order 4 with three important lags {yt-i, yt-2, yt-^}- The inclusion proportions of these 
three lags were all close to one, whereas the inclusion proportion of the intermediate 
lag yt-3 was close to zero. This suggests that given {yt-i, yt-2, yt-4}, yt-3 carries little 
additional information useful for predicting yt. 

The results can be explained by first noting that a third order representation of the 
two repeating patterns 1312 and 112 comprise the transitions 1|312, 3|121,1|213,2|131 
and 1|121,1|211, 2|112, respectively, with the conditioning sequence 121 appearing 
in both sets of transitions. A fourth order representation, on the contrary, gives 
transitions with unique conditioning sequences, namely 1|3121, 3|1213,1|2131, 2|1312 
and 1|1211,1|2112, 2|1121, respectively. Also, if the third lag yts is dropped, we still 
obtain transitions with unique conditioning sequences, namely 1|31 ■ 1, 3|12 ■ 3,1|21 ■ 
1,2|13 ■ 2 and 1|12 ■ 1,1|21 ■ 2,2|11 ■ 1, respectively. It is thus clear that a Markov 
chain of maximal order 4 with three important lags {yt-i, yt-2, yt-4} would provide 
a good characterization of the transition dynamics of the wood pewee data set, as is 
captured by the proposed tensor factorization based approach. 


6 Discussion 

The proposed nonparametric Bayesian method provides a flexible yet parsimonious 
representation of higher order Markov chains, allowing automated identihcation of 
the set of important lags while also facilitating testing of many hypotheses of prac¬ 
tical interest. In simulation experiments, our method substantially out-performed 
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Figure 7: Results for the wood pewee data set. (a) trace plot of marginal likelihood 
p{y I z); (b) trace plot of the number of important lags; (c) relative frequency dis¬ 
tribution of the maximal order; (d) relative frequency distribution of the number of 
important lags; (e) inclusion proportions of different lags; (f) relative frequency dis¬ 
tribution of the number of clusters of the probability kernels ] and (g) relative 

frequency distribution of 11^=1 ^ji number of possible combinations of (hi,..., hg). 
Panels (c)-(g) are based on thinned samples after burn-in. See Section |^for additional 
details. 


competitors when there were gaps in the set of important lags, while performing 
competitively with existing methods in other cases. We have found that such gaps 
are commonplace in applications we have considered. 

While the focus of this paper has been on higher order homogeneous Markov 
models, the proposed methodology can be easily extended to nonhomogeneous cases 
in which the transition dynamics is also influenced by exogenous predictors. Indeed, 
our computer codes already accommodate sequentially varying categorical predictors. 
Multiple categorical sequences can also be easily accommodated with the sequence 
label treated as an exogenous sequence specihc categorical predictor. Additional 
important directions of ongoing research include extensions of the methodology to 
other discrete state space dynamical systems, including higher order hidden Markov 
models and models for spatial and spatio-temporal categorical data sets. 
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Supplementary Materials 

The Supplementary Materials present some additional figures, describe the approxi¬ 
mate two-stage sampler used to determine the starting values of the MCMC sampler, 
discuss MCMC diagnostics, present an additional application of the proposed method¬ 
ology, and summarize the results produced by the VLMC and the RFMC methods 
for the data sets discussed in Section [H 

Appendix 

Appendix A Proof of Theorem 1 

Consider a Markov chain {?/*} of maximal order q with finite state space y and 
transition probability tensor P. Then {st = {yt, ■ ■ ■, yt-q+i)"^} is a first-order Markov 
chain with state space S = and transition probability matrix P with entries 

■ ■ ■ ijt—q)i {}ti ■ ■ ■ Pt—q+l)] 

Pr[(j/i . . . , y^_q^i I {yt—1 jt—li ■ ■ ■ 1 yt—q jt—q)\ 

{ Piyt = it I yt-i = jt-i, ■ ■ ■, yt-q = jt-q), if k-i = jt-e ior 
0, otherwise. 

The following lemma establishes a general posterior consistency result for first order 
Markov chain models. With P and Pq denoting the transition probability matrices 
of the first order representations of two Markov chains with respective transition 
probability tensors P and Pq, we have d{P,Po) = Xlsi X]s 2 l-P(®i)® 2 ) — Tb(si,S 2 )|- 
The conclusion of Theorem thus follows as a consequence. 

Lemma 1. Let {yt} be an ergodic Markov chain with finite state space and transition 
probability matrix P E V. Let U be a prior on V . Then for any Pq in the Kullback- 
Leibler support ofU and any 5 > 0, fl [P : d{P, Pq) > 6 \ yi:T} —t 0 a.s. Pq. 

Proof. Let y denote the state space of {yt}- Since {yt} is ergodic, it has a unique 
stationary distribution ttq, with 7io{j) > 0 for any j E y. We define the empiri¬ 
cal stationary distribution as ^tU) = Ylt=i^{yt — j}/P = j ^ 3^- 

Likewise, for any i,j E y, we define the empirical transition probability matrix as 
PriLf) = Er=i HVt-i = hVt = j}/Ef=i = i} = Wj/ny Define 4(,(P,Po) = 
EiEj7ro(i)|P(Zj)-Po(bj)l andK^q{P,Po) = E* Ej ^o(*)Po(b= K{ttqP,ttqPo). 
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Also let V = {P : Pq) > ^miiij 7ro(i)}. Then {P : d{P,Po) > 6} <TV. We have 


n {P : d{P, Po)>S\ yi:r} < | 


fyexp 

J^exp 




PodJ) ] 
P(iJ) J 

PodJ) ] 
P(iJ) J 


dn(P) 

dn(P) 

(A.l) 


By ergodic theorem, ttt and Pt converge almost surely to ttq and Pq, respectively 


(Eichelsbacher and Ganesh, 2002). Therefore, for the numerator, we have 


i j 

= lim y^y^{ni/T){nij/ni)log{{nij/ni)/P{i,j)}= lim K{^tPt,^tP)- 

T—^■oo ^ ^ ^ ^ T^oo 

^ j 


For any P eV and T sufficiently large, we have 


P(^tPt,^tP) >4^(Pr,P)/4 

> {rf^^(P,Po) - d5f^(PT,Po)}V4 > {26/3 - 5/3)V4 = 6^36 a.s. Pq. 


Therefore, with ft < 5‘^/3Q, we have 


lim exp(/?T) 

T-s-oo 


exp 


'V 


V—^ V—^ Hi rii j Pn(i, ?) 1 

ijTm P{t,j) J 


dU{P) 


< lim exp{(/9 — 6‘^/36)T} = 0 a.s. Pq. 

T->-oo 


(A.2) 


Similarly, for the denominator, we have — - 'f2j{ni/T){nij/ni)log{Po{i, j)/P{i, j)} —>■ 

— P^(,(Po,P) > — e a.s. Pq for any P with P^(,(Po,P) < e. For any ft > 0, choosing 
e = ft /2, using Fatou’s lemma we have 


exp(/3T) 




dIl{P) —)■ oo a.s. Pq. (A. 3) 


The proof follows combining (A.l), (A.2) and (A.S). 
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Appendix B Collapsed Conditional of k 


The two key steps in deriving the collapsed conditional of k in Section 3T were 
(a) to obtain a closed form expression for p{kj \ Zj,Wj) and then (b) to show that 
p(k I y, z, z*, A*, 77*) = Y['j=iPi^j I (b) follows easily by noting that the 

Markov blanket of kj, after TTk are integrated ont, comprises precisely Zj and Wj. We 
provide the technical details of the first step here. We nse the generic po to denote 
priors and hyper-priors. First, we note that integrating ont gives 


( \ f ( \ 0) 1 \ f 0) TT / + nj,r{i)} 

p{zj I Wj,kj) = / p(z,- I w,-,7r)J^/c,)po(<;)d7r)f/ = [[ ’ 


;Li 1 + nj,r) 


Co r ^ maxz^ 

n 1 n 


(%',r(b) 


Co 


n n r 

. r=l l=\ 


K'.rW) 





, (A.4) 


where = x{x -|- 1)... (x -1- m — 1) with x*^®^ = 1, Zj^r = {zj,t ■ Wj,t = r}, Uj^r = 


J2t=t* denotes the freqnency of the category of the predictor Wj 

and Uj^ri^) = J2't=t* ~ ~ denotes the nnmber of allocation variables 

that are associated with the category of the predictor and are instantiated at 
i. Also, since p{kj \ Wj) = po,j{kj), we have 


Co 

P(zi I Wj) = p{zj I Wj, kj)poj{kj) 

kj ihs.Xt’ , T } 

with = E^)'L^Po,i(A;i)n2i{r(^i7i)/r(fci7j +%.r)}- This yields a 

closed form expression for p{kj \ Zj,Wj) as 

kj = maxzj,..., Cq. 

(A.5) 


p{kj I Wj) p{zj I Wj, kj) _ Po,j{kj) n2i rcfc.V+n,,,) 


p{kj 


Zj,Wj^ 


p(Zj 


w,- 


U, 




max 


Zj) 


Cq max Zj 

n n 

.r=l 1=1 


7 




u, 






This completes the derivation. 
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Figure S.l: Induced prior probabilities of different lags to be included in the model {kj > 1, 
left panels) and the total number of important lags > I); right panels) for 

the proposed conditional tensor factorization based Markov model with Uq = 4 states and 


maximal order g = 10 for various values of p under the prior (16) of the main paper 
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S.2 Approximate Sampler 




Figure S.2: Graphical model depicting the dependency structure in a second order Markov 
chain {yt} for time point t. (a) The proposed model implementing soft clustering of 
~ with ~ Dir( 7 j-,..., 7 j) for all and ~ 

independently for all (hi,..., hq). (b) An approximation of the proposed model implement¬ 


ing hard clustering of Zj^t 


77 


(i), 




i) with ^ and 


^hi,...,hg ~ Dir(Q;,..., a) independently for all (hi,..., hq). This approximation forms the 
basis of the approximate sampler described in Section S.2 of the Supplementary Materials. 


This section describes the approximate sampler used to determine the starting values of 
the latent class allocation variables z for the MCMC sampler described in Section 3.1 of the 
main paper. 

Given a model indexed by k = {hi,. .., kq}, the levels of Wj are partitioned into kj clusters 
{Cj^r ■ r = 1, ... ,kj} with each cluster Cj^r assumed to correspond to its own latent class 
hj = r. With independent Dirichlet priors on the mixture kernels \hi,...,hq ~ Dir(Q;, ...,«) 
marginalized out, the likelihood conditional on the cluster conhgurations C = {Cj^r '■ j = 
1,... ,g,r = 1,... ,hj} is given by 


p(y I C) 


n 


(5{a + ..., g -F nfe,,...,^(Go)} 

/3(q:, ...,«) 


(S.l) 


where nh^^...,hg{y) = Y7t=t* e .. .,Wq^t e Cm,hg}- Given the current model 

indexed by k = (hi,..., kq} and clusters C = {Cj^r ■ j = 1, ■ ■ ■, = 1) ■ ■ •) we do the 

following for j = 1,... ,q. 

















SUPPLEMENTARY MATERIALS 


S.3 


1. If kj < Uo, we propose to increase kj to {kj + 1). If kj > 1, we propose to decrease 
kj to {kj — 1). For 1 < kj < Cq, the moves are proposed with equal probability. For 
kj = 1, the increase move is selected with probability 1. For kj = Uq, the decrease 
move is selected with probability 1. 

2. If an increase move is proposed, we randomly split a cluster of Wj into two clusters. We 
accept this move with acceptance rate based on the approximated marginal likelihood. 

3. If a decrease move is proposed, we randomly merge two clusters of Wj into a single 
cluster. We accept this move with acceptance rate based on the approximated marginal 
likelihood. 

The latent class allocation variables z are initialized at the cluster allocation variables re¬ 
turned by the approximate sampler after 100 iterations. 


S.3 MCMC Diagnostics 


Figure 0 shows some additional MCMC diagnostics (Cowes and Carlin, 1996; Flegal and 
Jones, 2011) based on thinned samples for the case (G) with Co = 3 categories, true im¬ 
portant lags {yt-i,yt-A,yt-8} and sample size T = 500 for the data set with the median 
classihcation error rate in the simulation experiments. Our model accommodates uncer¬ 
tainty in the set of important lags. This set may vary from one MCMC iteration to another. 
To draw the trace plot for p{y \ accommodating variable lag sets, where { 11 , 14 ,, is] 

denotes a specihc value of {yt-i,yt- 4 ,yt- 8 }, we hrst identihed a to from {(T -|- 1),... ,N] 
such that {ytQ-i,yto- 4 ,ytQ- 8 ] = {CG 4 G 8 }- The trace plot for p{y \ 14 , 14 , 1 ^) is then based on 
estimates of p{y \ ytQ-i,ytQ- 2 , ■ ■ ■ ,yto-io) for different MCMC iterations. Our experiments 
with different with same {ytQ-i,ytQ- 4 ,yto- 8 } = {^,* 408 } produced very similar results. 
As Figure [S]^ shows, the running means and quantiles are very stable, Monte Carlo standard 
errors were small, and there is good agreement between the truth and the running posterior 


means. 


Figure S.4 shows similar diagnostic plots for the MCMC output for the wood pewee data 


set analyzed in Section |5.2| of the main paper. In this case the truth is unknown. Following 

we assumed {l/t-i, 2 /t- 2 , 2 / 4 - 4 } to be the set of important lags. 

Vto—i, ■ ■ ■, 2 /to—10) tor 


5.2 


the discussion in Section 
The trace plot for p(y | ii,i 2 ,i 4 ) is thus based on the estimates of p(y 


different MCMC iterations for some to from {T -|- 1,..., N] such that {yto-i, yto- 2 , yto- 4 } = 
{^ 15 * 25 * 4 }- The running means and quantiles are again very stable with small Monte Carlo 
standard errors and the estimated posterior means agree well with our empirical expectations. 

In all examples, the quantiles can be used to construct 90% posterior probability regions. 
Since the transition probabilities have variances uniformly bounded above by 1/4, Monte 
Carlo standard errors in the hnal posterior mean estimates have a conservative uniform 
upper bound of 1/(2a/8000) ^ 0.0056. 
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Figure S.3: Trace plots of some transition probabilities for the case (G) with Cq = 3 cate¬ 
gories, true important lags {yt-i,yt- 4 :,yt- 8 } and sample size T = 500 for the data set with 
the median classihcation error rate in the simulation experiments. In each panel, the solid 
blue line shows the running mean and the horizontal dashed blue line shows the correspond¬ 
ing true value. The darker green lines show the 5% and 95% running quantiles. The number 
at the middle of each panel shows the Monte Carlo standard error estimated by batch means 
analysis with batch length 100. See Section of the main paper and Section S^ of the 
Supplementary Materials for additional details. 


S.4 Analysis of Human Preproglucagon Gene Data Set 


In this section, we present an analysis of a DNA sequence found in the human preproglucagon 


gene (Bell et a/.,1983). There are 1752 data points and four states A, C, G and T. Avery 


and Henderson (1999) and Besag and Mondal (2013) analyzed the data set focusing their 
attention on Markov models of up to third order, using asymptotic tests and simulation 
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Figure S.4: Trace plots of some transition probabilities for the wood pewee data set. In each 
panel, the solid blue line shows the running mean. The darker green lines show the 5% and 
95% running quantiles. The number at the middle of each panel shows the Monte Carlo 
standard error estimated by batch means analysis with batch length 100. See Section a of 


the main paper and Section S.3 of the Supplementary Materials for additional details. 


based exact tests, respectively, to assess £t. The test of a hrst order Markov chain against 
a second order alternative led to rejection of the null hypothesis at the level 0.019, and the 
test of a second order null versus a third order alternative produced a p-value of 0.34, whereas 
the corresponding simulation based tests produced p-values of 0.028 and 0.44, respectively, 
providing evidence that a second order model gives the best £t to the data set among the 
candidate models. 

Figure |S.5 summarizes the results produced by the proposed conditional tensor factor¬ 
ization approach with maximal order q = 3 applied to the hrst 1000 data points. Due to 
minor mixing issues, in this case we ran the MCMC algorithm for 5 million iterations (to be 
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conservative) with the initial 2 millions discard as bnrn-in. The estimated posterior prob¬ 
abilities of hrst, second and third order Markov models were approximately 0 (the MCMC 
chain never visited this model), 0.43 and 0.57, respectively. With a posterior odds of oo for 
a second order model against a hrst order model and a posterior odds of 1.33 for a third 
order model against a second order model, the results were in general agreement with the 
frequentist analyses. 

The proposed conditional tensor factorization based approach enables us to test for serial 
dependencies of much higher orders. With the maximal order set at g = 10, the posterior 
probability of the model being of maximal order 7 was estimated to be approximately 0.94. 
The MCMC algorithm never visited a Markov model of maximal order 2. Figure S.6 sum¬ 
marizes the results. Markov models are widely used for nucleotide sequences, and hence 
the ability to ht more realistic models containing higher order dependence is of substantial 
importance in this application area. 



Figure S.5: Results for the human preproglucagon gene data set with the maximal order set 
at g = 3. (a) trace plot of marginal likelihood p{y \ z); (b) trace plot of the number of im¬ 
portant lags; (c) relative frequency distribution of the maximal order; (d) relative frequency 
distribution of the number of important lags; (e) inclusion proportions of different lags; (f) 
relative frequency distribution of the number of clusters of the probability kernels Xhi,...,h ; 
and (g) relative frequency distribution of 111=1 number of possible combinations of 

(hi,..., hq). Panels (c)-(g) are based on thinned samples after burn-in. 
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Figure S.6: Results for the human preproglucagon gene data set with the maximal order set 
at g = 10. (a) trace plot of marginal likelihood p(y | z); (b) trace plot of the number of im¬ 
portant lags; (c) relative frequency distribution of the maximal order; (d) relative frequency 
distribution of the number of important lags; (e) inclusion proportions of different lags; (f) 
relative frequency distribution of the number of clusters of the probability kernels Xhi,...,hg] 
and (g) relative frequency distribution of YYj=i ^^e number of possible combinations of 
(hi,..., hq). Panels (c)-(g) are based on thinned samples after burn-in. 


S.5 Results Produced by VLMC and RFMC 


Figure |S.7 shows the context trees (see Machler and Biihlmann, 2004) summarizing the 


serial dependence structures and transition probabilities estimated by the VLMC method, 
as implemented by the VLMC package in R, applied to the real data sets discussed in Section 
l^of the main paper and Section ST of the Supplementary Materials. The pruning parameter 
was selected by AIC criterion. 

Figure [Sl^ shows the relative importance of different lags estimated by the RFMC method, 
as implemented by the radomForest package in R, applied to the real data sets discussed in 
Section of the main paper and Section S.4| of the Supplementary Materials. 

For the epileptic seizure data set, the wood pewee data set and the human gene data 
set, the VLMC method estimated Markov chains of maximal orders 9, 4 and 4, respectively. 
For the seizure data set, the entire sequence consisted of only 204 data points. With the 
hrst 200 data points used to £t the models, there were not enough additional observations 
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to evaluate prediction performances. For the wood pewee data set and the human gene 
data set, we used the hrst 500 and 1000 data points to £t the models and the following 
500 observations to evaluate one-step ahead prediction performances. For the wood pewee 
data set, classification error rates for our proposed conditional tensor factorization based 
approach, VLMC and RFMC were 0.021, 0.024 and 0.026, respectively. For the human gene 
data set, classihcation error rates for our proposed conditional tensor factorization based 
approach, VLMC and RFMC were 0.64, 0.66 and 0.66, respectively. 

While the maximal orders and the classihcation error rates estimated by the two com¬ 
peting methods are in general agreement, the VLMC method, with a top-down tree based 
mechanism to model serial dependencies, could not detect gaps in the set of important lags 
for the hrst two data sets, which were suggested by the proposed conditional tensor fac¬ 
torization approach. Such gaps seem to be a common feature of many data sets, which is 
commonly obscured by existing statistical methods. 
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Figure S.7: Context trees produced by the VLMC method for the data sets discussed in 
Section 1^ of the main paper and Section S.4 of the Supplementary Materials. 
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Figure S.8: Plots showing the relative importance of different lags as estimated by the RFMC 
method for the data sets discussed in Section of the main paper and Section ST of the 
Supplementary Materials. 
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